Parallel Sorted Neighborhood Blocking with MapReduce
نویسندگان
چکیده
Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution. In particular, we propose and evaluate two MapReduce-based implementations for Sorted Neighborhood blocking that either use multiple MapReduce jobs or apply a tailored data replication.
منابع مشابه
On the Complexity of Sorted Neighborhood
Record linkage concerns identifying semantically equivalent records in databases. Blocking methods are employed to avoid the cost of full pairwise similarity comparisons on n records. In a seminal work, Hernàndez and Stolfo proposed the Sorted Neighborhood blocking method. Several empirical variants have been proposed in recent years. In this paper, we investigate the complexity of the Sorted N...
متن کاملParallel Heuristics for TSP on MapReduce
We analyze the possibility of parallelizing the Traveling Salesman Problem over the MapReduce architecture. We present the serial and parallel versions of two algorithms Tabu Search and Large Neighborhood Search. We compare the best tour length achieved by the Serial version versus the best achieved by the MapReduce version. We show that Tabu Search and Large Neighborhood Search are not well su...
متن کاملSorted Neighborhood for Schema-free RDF Data
Entity Resolution (ER) concerns identifying pairs of entities that refer to the same underlying entity. To avoid O(n) pairwise comparison of n entities, blocking methods are used. Sorted Neighborhood is an established blocking method for Relational Databases. It has not been applied to schema-free Resource Description Framework (RDF) data sources widely prevalent in the Linked Data ecosystem. T...
متن کاملN-Way Heterogeneous Blocking
Record linkage concerns the linkage of records between two tabular datasets. To avoid naive quadratic computation, typical solutions employ a technique called blocking. A blocking scheme partitions records into blocks, and generates a candidate set by pairing records within a block. Current models of blocking have been restricted to two homogeneous datasets. The variety aspect of Big Data motiv...
متن کاملCloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming
The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...
متن کامل